Writing a multi-stage compiler

I write this post as I am in the process of rewriting my self-hosted compiler. I chose to separate the compiler stages cleanly, so that for syntax highlighting, the tokeniser and parser stages can be used as a stand-alone component. However, I had previously rewritten the whole AST from scratch for every new compiler stage (parser, scoper, resolver, template instantiator), which made the code hard to maintain.

I now follow a new pattern using templates: I write a generic implementation of the AST, and all of its types take a common template argument, which injects compiler stage specific functionality and types. Here's an example of how to write a stage-dependent namespace-qualified symbol type (simplified):

/// The stage-dependent symbol implementation.
::ast [Stage: TYPE] Symbol
{
    Child
    {
        Name: Stage::Name;
        Position: src::Position;
    }

    Children: Child - std::Vector;
    IsRoot: BOOL;
}

/// The actual parser stage.
::parser ParserStage
{
    TYPE Previous := :nothing;
    TYPE Context := src::File \;

    // Expose ast::Symbol as Stage::Symbol.
    TYPE Symbol := ast::[ParserStage]Symbol;
}

Here, we see a simplified symbol type implementation that depends on the stage it is in (Stage::Name might be a string or a direct reference to a node in the AST, depending on the stage). Additionally, we see how the stage type itself is defined, exposing stage-specific types under standardised names (ParserStage::Symbol). This pattern is fairly straightforward, and as is already hinted at in the code example above, a Stage::Previous reference to the previous stage and Stage::Context type can be used to make the whole stage flow completely abstract and modular. If we have a generic function as follows, we can simply transform one stage's AST into the next stage's AST without having to reimplement any of the general workflow:

[Stage: TYPE] transform(previous: Stage::Previous #&, context: Stage::Context) Stage

The only thing we need is to configure the separate stage types and the stage-specific types to be constructible from their counterparts in the previous stage. So, the resolver stage's symbol type no longer needs to be a segmented path name, but can directly point to the resolved entity name, and maybe also contain the resolved template arguments. The instantiator stage would then further resolve symbols in its representation by directly pointing to the templated instantiations of types and functions.

These changes in the data model between different stages are easy to manage, as only the specific types in the AST that change need to be adapted and the generalised stage workflow will take care of the rest. This technique applies to all multi-stage data transformations as long as the different stages still use the same general data view.